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Abstract 

Infrastmcture-as-a-Service (laaS) providers need to offer richer services to be competitive while optimizing their 
resource usage to keep costs down. Richer service offerings include new resource request models involving bandwidth 
guarantees between virtual machines (VMs). Thus we consider the following problem: given a VM request graph 
(where nodes are VMs and edges represent virtual network connectivity between the VMs) and a real data center 
topology, find an allocation of VMs to servers that satisfies the bandwidth guarantees for every virtual network 
edge — which maps to a path in the physical network — and minimizes congestion of the network. 

Previous work has shown that for arbitrary networks and requests, finding the optimal embedding satisfying 
bandwidth requests is A/'T-'-hard. However, in most data center architectures, the routing protocols employed are 
based on a spanning tree of the physical network. In this paper, we prove that the problem remains A/'P-hard even 
when the physical network topology is restricted to be a tree, and the request graph topology is also restricted. We 
also present a dynamic programming algorithm for computing the optimal embedding in a tree network which runs in 
time 0(3'' n), where n is the number of nodes in the physical topology and k is the size of the request graph, which is 
well suited for practical requests which have small k. Such requests form a large class of web-service and enterprise 
workloads. Also, if we restrict the requests topology to a clique (all VMs connected to a virtual switch with uniform 
bandwidth requirements), we show that the dynamic programming algorithm can be modified to output the minimum 
congestion embedding in time O(fc^n). 



1 Introduction 



Infrastructure-as-a-Service (laaS) providers like Amazon fhttal, Rackspace fhttdl and Go-grid |hh| provide computing 
and other services on demand and charge based on usage. This has resulted in the commoditization of computing 
and storage. Typically, these providers provide service level agreements (SLA) Ihttell where they guarantee the type 
of virtual machines (VMs) that they provide and the amount of disk space available to these VMs. Although some 
providers offer additional services like dedicated firewalls and load-balancers, no network performance guarantees are 
provided, which are critical for workloads like content distribution networks, desktop virtualization, etc. Given the 
rapid growth and innovation in these services fhcbiglc], it is important for service providers (SPs) to offer innovative 
service models for differentiation, e.g., by offering richer network SLAs to be competitive while optimizing their 
resource usage to keep costs down. 

Next generation cloud services will require improved quality of service (QoS) guarantees for application work- 
loads. For example, multi-tier enterprise applications [httb] require network isolation and QoS guarantees such as 
bandwidth guarantees, and for over-the-top content distribution using a cloud infrastructure, bandwidth, jitter and 
delay guarantees are important in determining performance. Similar guarantees are necessary for MapReduce-based 
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analytics workloads too. Moreover, networking costs are currently a significant fraction of the total infrastructure 
cost in most data center (DC) designs IGTl II Ihttcl since servers are cheap compared to core switches and routers. 
Thus, in order to provide richer network SLAs, it is important for SPs to ensure that networking resources are effi- 
ciently utilized while at the same time ensuring low congestion (that leads to better load balancing and more room for 
overprovisioning). 

In this paper we consider a virtualization request model in which clients can request bandwidth guarantees between 
pairs of virtual machines (VMs) |IGLW"'"101 for which SPs will allocate resources within their infrastructure. This 
naturally leads us to study the following resource allocation problem: given a VM request graph — where nodes are 
VMs and edges represent virtual network connectivity between the VMs — and a real data center topology, find an 
allocation of VMs to servers that satisfies the bandwidth guarantees for every virtual network edge and minimizes 
congestion of the network. Note that in this setting, each virtual edge maps to a path in the physical network topology. 

The above request graph model is driven by application workloads that execute on top of network infrastructure 
provided by the SPs. Common workloads include enterprise applications BhttbL MapReduce IIDG08L and web hosting, 
and different workloads can lead to different service models. For instance, many web services request a small number 
of VMs to implement the web servers, the application servers, and the database. The VM implementing the web server 
receives a request and forwards it to an application server VM, which in turn queries the database server VMs. In such 
cases, specific bandwidth guarantees between the outside world and the web server, the web server and the application 
server, and so on, are important to ensure QoS. In MapReduce workloads on the other hand, it has been shown that 
network optimization can yield better results than adding machines IhsinfmclOL but in this setting since all the VMs 
implementing map and reduce tasks communicate with each other via data shuffle, the aggregate bandwidth available 
to the VMs may determine the application performance. 

A number of metrics have been studied to measure the network load including congestion, jitter, delay, hop count, 
or a combination of the above. Here we focus on minimizing congestion, but we also note that our algorithmic 
techniques are generic and can easily be adapted to optimize other metrics. 

It has been shown previously that the problem of embedding virtual requests in arbitrary networks is A/^T'-hard 
flCRB09ilGLW+I0l . However in most data center networks, routing protocols used rely on a spanning tree of the 
physical network Bhttcll . Hence, in this paper we study the problem of minimizing network congestion while allocating 
virtual requests when the network topology is restricted to be a tree. 

1.1 Our Contributions 

First, we prove that optimally allocating VMs remains J\fV-haid even when both the physical network topology 
and request topology are highly restricted. We show that if the network topology is a tree then even for simple 
request topologies like weighted paths with the weights signifying the amount of bandwidth required between the 
corresponding VMs, it is AfV-hard to approximate the minimum congestion to a factor better than 0(6*), where 9 is 
the ratio of the largest to smallest bandwidth requirements in the path request. We also show that in the unweighted 
case (or uniform bandwidth requirement on all edges) the problem is A/^T'-hard to approximate to within a factor of 
0{n^~'^) for any e G (0, 1), even for the case when the request topology is a tree. 

Given these complexity results, we cannot hope for an efficient algorithm for all instances of the problem. However, 
we note that in practice, many workloads consist of a small number of VMs allocated in a huge datacenter. Accord- 
ingly, our second result is a dynamic programming algorithm (Algorithmic for computing the minimum congestion 
embedding of VMs in a tree network for any request graph, which satisfies the pairwise bandwidth requirements and 
runs in time 0{3'^n), where n is the number of nodes in the physical topology and k is the number of VMs in the 
request graph. Enterprise workloads often consist of small requests with specific bandwidth requirements between 
VMs, and for these instances the exponential 0{3'^) term is quite small, and can thus be optimally served using our 
algorithm whose run time is only linear in the network size. 

Third, workloads like Map-Reduce jobs have too many VMs to use an algorithm with a runtime of 0(3*^n), but 
these have uniform bandwidth requirements between the VMs IMPZIOII , and we show that the exponential dependence 
on k can be removed when the request network is uniform. For the special case in where the requests are restricted 
to be cliques or virtual clusters IBCKRl IB . we propose an algorithm that finds the minimum congestion embedding 
in 0{k^n) time (Algorithm O. Hence our algorithms yield the minimum congestion embeddings of virtualization 
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requests for several common use cases. 

We also present simulations which validate our results for common request models and practical network configu- 
rations. 

1.2 Outline of the paper 

The paper is organized as follows. We first review previous work in Section |2] and formally define the problem and 
notation in Section [3] We prove the hardness results in Section |4] followed the algorithms in Section |5] In Section |6] 
we provide simulations, which validate the running time and correctness of our algorithms. Finally, we conclude and 
point to future work in Section Q 

2 Related Work 

Previous work has shown that the problem of embedding virtual request graphs in arbitrary physical networks is J\fV- 
hard ff CRB091lGLW"'"10l . A number of heuristic approaches have been proposed including mapping VMs to nodes 
in the network greedily and mapping the flows between VMs to paths in the network via shortest paths and multi- 
commodity flow algorithms MFA06I IZA06I . However these approaches do not offer provable guarantees and may lead 
to congested networks in some circumstances. The authors of IICRB09II assume network support for path-splitting 
ly YRC08 1 in order to use a multi-commodity flow based approach for mapping VMs and flows between them to the 
physical network, but this approach is not scalable beyond networks containing hundreds of servers IIGLW+IOII . 

Guo et g/. llGLW+lOl proposed a new architectural framework, Secondnet, for embedding virtualization requests 
with bandwidth guarantees. This framework considers requests with bandwidth guarantees fij between every pair 
of VMs {vi,Vj). This framework provides rigorous application performance guarantees and hence is suitable for 
enterprise workloads but at the same time also establishes hardness of the problem of finding such embeddings in 
arbitrary networks. Our results employ the SecondNet framework but restrict attention to tree networks. 

Very recently, Ballani et al. IIBCKRI 111 have described a virtual cluster request model, which consists of requests of 
the form (fc, B) representing k VMs each connected to a virtual switch with a link of bandwidth B. A request (fc, B) 
can be interpreted (although not exactly) as a clique request on k VMs with a bandwidth guarantee of B / {k—1) on each 
edge of the clique. They describe a novel VM allocation algorithm for assigning such requests on a tree network with 
the goal of maximizing the ability to accommodate future requests. For each v in the tree network T, they maintain 
an interval of values that represents the number of VMs that can be allocated to Xu without congesting the uplink 
edge from v and allocate VMs to sub-trees greedily. We generalize this approach to the case of virtualization requests 
in the Secondnet framework and we use a dynamic programming solution in order to find the optimal minimum 
congestion embedding. By restricting the requests to virtual clusters, |BCKR11| offers a tradeoff to the providers 
between meeting specific tenant demands and flexibility of allocation schemes. In this work, we explore this tradeoff 
further and show that it is possible to formulate flexible allocation schemes even in the Secondnet framework for small 
requests. 

The problem of resource allocation has also been studied in the virtual private network (VPN) design setting where 
bandwidth guarantees are desired between nodes of the virtual network iDGG+99[ |GKK+0lV In this setting, a set 
of nodes of the physical network representing the VPN endpoints is provided as the input, and the task is to reserve 
bandwidth on the edges of the network in order to satisfy pairwise bandwidth requirements between VPN endpoints. 
The fixed location of VPN endpoints makes this problem significantly different from that of embedding virtualization 
requests in a network, since the latter involves searching over all possible embeddings of the VMs in the network. 

3 Preliminaries 

An instance of our problem consists of a datacenter network and a request network. The datacenter network is a 
tree on n nodes rooted at a gateway node g. Edges in N have capacities Ce representing their bandwidth. Let L denote 
the set of leaves of N . 
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The request network Gr is an arbitrary, undirected graph on k + 1 nodes. Nodes in Gr consist of a set V of k 
virtual machines vi, . . . ,Vk and a special gateway node g. Edges e in the request graph specify bandwidth guarantees 
fe (flow requirements) and are divided into two types: edges of type-I have the form e = {vi,g) and specify a 
requirement for routing fe flow between Vi and the gateway node g (uplink bandwidth to the outside world), and edges 
of type-II have the form e = {vi, Vj ) and specify flows between two virtual machines Vi and Vj ("chatter" bandwidth 
between virtual machines). We use and R^^ to denote the sets of type-I and type-II edges and R = R^ U R^^ to 
denote all edges. 

A solution consists of an embedding it -.V L mapping virtual machines onto leaves in the datacenter network. 
For simplicity we will assume only a single VM can be mapped to each leaf, although it is easy to modify our algorithm 
so that each datacenter node v can support up to riy VMs. The gateway node g in Gr is always mapped to the gateway 
in N . If TT maps the endpoints of edge e = {vi,Vj) (equivalently e = {vi,g)) onto 7r(w,;) and TT{vj), then e contributes 
fe flow to every edge along the path PTr(vi),TT(vj) between TT{vi) and TT{vj) in N. The congestion of an edge e in 
under embedding tt is 



and our goal is to find tt minimizing maxgg jy Cong(7r, e). 

4 Hardness results 

In this section we show that the embedding problem is AfV-haid even with the restricted topologies of the host and 
request graphs. In particular, we show that the problem of embedding a weighted path request, which is perhaps the 
simplest weighted request topology, is AfV-hsnd to approximate to a factor better than 0{9), where 9 is the ratio of 
the largest to smallest bandwidth requirements. Furthermore, we show that in the unweighted case the problem is 
TV'P-hard to approximate to a factor smaller than 0{n^^'^) for any constant e G (0, 1), where n is the number of VMs 
in the request, even for the case when the request topology is a tree. 

Both of our reductions are from 3-partition. An instance of 3-partition consists of a multiset S — {si, . . . , s^m} 
of 3m integers summing to niB, and the goal is to determine whether S can be partitioned into m subsets Si, ... , Sm 
such that the sums of elements in each of the Si are equal to B and \Si\ = 3 for all i. Crucially, 3-partition remains 
TVA'P-complete even when the size of the integers are bounded by a polynomial in m: 

Theorem 1 ( IIGJ79I ). The 3-partition problem is strongly AfV -complete, even when B /A < Si < B 12 for all i, 
forcing any partition to consist of triples. 

4.1 Weighted topologies 

Theorem 2. The embedding problem is AfV-complete even when restricted to instances where the request graph is a 
weighted path, and the host network is a tree. Moreover, it is MV-hard to approximate to within a factor better than 
9 /Q, where 9 is the ratio of the largest to smallest weight in the request graph. 

Proof. First, the problem is in AfV, since given a candidate embedding, it is easy to verify that its congestion is at 
most 1. 

Now, let S = {si, . . . , S3„i} be a multiset of 3m integers summing to mB, constituting an instance of 3-partition, 
such that B /A < Si < B/2 for all i. Let T be a tree of height two. The root/gateway g has m children labeled 
Si, ... , Sm, each of which has B children of its own. Since 3-partition is strongly A/'P-complete, we may assume that 
B is bounded by a polynomial in m, so T has polynomial size. All edges from g to the Si have capacity 6. Each node 
Si is connected to each of its B children by edges of capacity W > 6. 

Let R = R^ U R^^ be defined as follows. Let V — {vi, . . . , Vms} be a set of VMs. For j = 1, . . . , 3to + 1 let 
Qj — J2i=o where we set sq = for convenience (note that qi = and qsm+i — mB). Further, define heavy 
intervals as Ij — {ug^.+i, . . . , Vq^^-^ }, j = 1, . . . , 3to, so that \Ij \ ~ Sj. 






4 



Define chatter bandwidth requests fij by setting 

^ _ ( W, if {i,j} C /fc for some k 
■' [1 otherwise. 

Define upHnk bandwidths as fi = 1 for i = 1 and = otherwise. Thus, the requests form a path with the 
first node on the path connected to the gateway node. The path is partitioned into intervals of length s^, such that 
the bandwidth requirement between consecutive nodes in each interval is high and the requirement between adjacent 
nodes on the path that belong to different intervals is low. We refer to the edges of weight W as heavy edges and the 
edges of weight 1 as light edges. 

If S has a 3-partition, then the heavy intervals Ij can be divided into m sets Vi, . . . , Vm of 3 intervals each, such 
that the sum of the lengths within each Vi is exactly B. We can map all VMs in Vi to the children of node Si. Each 
edge (g, Si) carries flow from at most 2 light edges on the border of each of the 3 heavy intervals in Vi, and each edge 
connecting Si to its children has load at most W , for a congestion of 1 . 

Now suppose that S does not have a 3-partition. Then since by assumption B/A < Si < B/2, in any feasible 
allocation of VMs at least one heavy interval Ik must be divided between children of different nodes 5"^ and Sj, and 
hence at least one heavy edge must congest the edge (r, Si), yielding congestion at least W/6. 

Thus, it is A/^T'-hard to distinguish between instances with an optimal congestion of 1 and W/6 — 6/6, where 9 is 
the ratio of largest and the smallest weight in the request graph, i.e. 9 = W/1. □ 

4.2 Unweighted topologies 

Theorem 3. Let n denote the number of leaves in the host tree. The embedding problem is J^V -complete and MV- 
hard to approximate to within a factor better than r2(n^~"^), for any e S (0, 1), when the set of requests forms an 
unweighted tree. 

Proof. As before, we first note that the problem is in AfV, since given a candidate embedding, it is easy to verify 
that its congestion is at most 1. We use a reduction to 3-partition similar to the reduction to Maximum Quadratic 
Assignment used in OHLS09I . 

Let S — {si, . . . , S3„j} be a multiset of 3m integers summing to mB, constituting an instance of 3-partition, such 
that B / A < Si < B /2 for all i. Let T be a tree of height two. The root g has m children labeled Si, ... , S„i, each of 
which has i + B ■ M children of its own, where AI ~ {5mB) r^i"*^)/"^! . Since 3-partition is strongly A/^T'-complete, 
we may assume that B is bounded by a polynomial in m, so T has polynomial size. Each node Si is connected to each 
of its 3 + -B • AI children by links of capacity B ■ AI + 2, and the root is connected to each of Si by links of capacity 6. 

We now define R = U R" . Let V ^ U V^, where = {v\,..., u^} and = [vl, w^^bmI be a 
set of VMs organized in a tree as follows. First for j — 1, . . . , 3m -I- 1 let qj — J2iZo where we set sq = for 
convenience. We now define bandwidth requirements between VMs in V. Each vj S requires chatter connections 
of bandwidth 1 to '^^/.(g.+i), "M.(g • • ■ > ^M g +r Also, vj requires a chatter connection to vj_^ if i > 1 and 

vj+i if i < m. Finally, both vl and v}^ require uplink connections to gateway g of bandwidth 1 . Thus, the request 
topology is a tree consisting of stars on Si ■ AI nodes with centers vj, for each i — 1, . . . ,m. Adjacent centers of stars 
(i.e. vj and vj for \i — j\ = 1) are connected to each other 

If S admits a 3-partition, then there exists an embedding of congestion at most 1: assign the corresponding three 
centers and their children to the children of Sj for j = 1, . . . ,m, which is possible since each Sj has exactly 3 + B ■ AI 
children. The congestion is at most 1 since the edges of T incident on the nodes where the centers are mapped 
will carry load exactly B ■ AI + 2 (B ■ AI unit bandwidth connections to the children as well as two connections to 
neighboring centers or uplink connections), and the edges {Sj , g) will carry at most 2 units from each of the 3 centers 
mapped to the children of Sj, yielding congestion at most 1. 

Now suppose that S does not admit a 3-partition. Consider the node Sj G T with the maximum number of centers 
mapped to its children. Denote these centers hy vj^, . . . ,vj^, where fc > 3. We then have J2j=i > B + 1, and 
hence at least AI children of vj^, . . . , vj^ are mapped outside the set of children of Sj . Hence, at least M edges from 
the centers , . . . , vj^ to these children congest the edge {Si, g), where g is the root of T. Thus, the congestion is at 
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least M/6. The number of vertices in the tree T is n = 1 + m(3 + S • M) < 1 + (3 + B)m ■ M < {5mB) ■ M < 
Af"/(i-^)+i = Mi/(i"^). Hence, the congestion is at least M/6 > n^^VG. 

We have shown that it is AfV-hsad to distinguish between instances of the problem where the minimum congestion 
is 1 and f2(n^^'^), thus completing the proof. □ 

5 Algorithm 

Next we present our algorithmic results and show that despite the 7\A'P-completeness results in the previous section, 
many practical instances can still be solved efficiently. 

5.1 Creation of binary tree 

We first convert the tree N into a binary tree T with not many additional nodes in a way that preserves the congestion 
of all solutions. This step is purely for convenience in simplifying the presentation of the algorithm that follows. 
We simply replace each degree d node with a complete binary tree on d nodes. Algorithm [T] describes the procedure 
Create-Binary-Tree(A^, g) more formally. 



Algorithm 1: Create-Binary-Tree( A^, 5) 
1: for all V € N, degree(w) > 3 do 

2: Let iti, . . . , be the children of v, and ei, . . . ,6^ the edges connecting v to Ui 

3: Replace ei , . . . , with a binary tree rooted at v with leaves ui, . . . ,Ud 

4: Set the capacity of parent edges of Ui to be Cg. and that of all other new edges to be 00 

5: end for 



Let T be the resulting binary tree. We first show that the congestion of embedding into T and A^ is equal: 

Lemma 4. The congestion of embedding any request graph Gr into a tree N rooted at node g is equal to the conges- 
tion of embedding Gr into the binary tree T constructed by the procedure Create-Binary-Tree( N , g) 

Proof. Consider any embedding tt of Gr into A^. Since the auxiliary nodes inserted are not leaves, tt defines an 
embedding of Gr into T as well. Let u,v N DT and P.^^, Pj^y be the edges on the unique paths between u and v 
in A^ and T. Observe that P^^^ C P^^, and that all edges in P^^ \ P^^ have infinite capacity and contribute nothing 
to the congestion. Hence the congestion of embedding in A^ and T is equal. □ 

Next, we show that T is not much bigger than A^ : 

Lemma 5. The number of nodes is T is at most 2n and the height ofT is 0{H log A) where A is maximum degree in 
N and H denotes the height of N. 

Proof. We replace each node v of degree d, with a complete binary tree on d leaves, which has at most 2d nodes. 
Therefore, the number of nodes in T is at most 2n. Also by this replacement, we stretch sub-trees of height 1 by a 
factor at most [log A] which shows that the height of T is 0{H log A). □ 

5.2 Minimum congestion of embedding requests in a binary tree 

Now we present our primary algorithmic result and show that if the request graph is small — which is true in many 
practical instances — then the optimal embedding can be found efficiently. Before describing the algorithm, we intro- 
duce some notation. For any node ?i G T we use the symbol e„ to denote the link joining the parent of node utou 
and Tu to denote the subtree of T rooted at u. If u is not a leaf, we refer to the "left" and "right" children of w in T as 
ui and Ur respectively. In this section we assume that the tree T rooted at g is binary and of height H. Let U denote 
the set of vertices in T at distance j from g, so L'^ = {g}, while denotes the leaves at the lowest level. 
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The algorithm is straightforward dynamic programming. Starting at the leaves of T, and moving upwards towards 
the root, for each node u G T and set 5 C F we calculate the congestion of the optimal embedding of the VMs in S 
into Tu using the congestion of embeddings into us children. Let Flow[S'] denote sum of the bandwidth requirements 
crossing the cut {S, V U {g} \ S) in Gr, and Cong[u, S] denote the optimal congestion of the edges of when 
embedding the subgraph of Gr spanned by S into T„. Then Cong[u, S] satisfies the recurrence 

Cong[u, S] = min max {Cong[u;, S"/], Cong[Mr, S \ S";], Flow[5';]/ce, , Flow[S' \ Si]/ce^} 

Si C5 

That is, it is the minimum over all partitions {Si,S \ Si) of S of the congestion of embedding Si into Tu, and S \ Si 
into Tu^. The terms Flow[S';]/ce, and Flow[S' \ Si]/ce^ are the congestion on the edges connecting u to its children. 
The base case is when u is a leaf, in which case 



Cong[M, S] 



if|S'|<l 
oo if|S'|>l 



assuming for simplicity that each server can support at most a single VM. By changing this equation, we can easily 
allow a server u G T to support up to n„ VMs. 

After computing these recurrences, the algorithm outputs Cong[(7, V]. Note that L'^ ~ {g} and that it suffices to 
compute Cong[5, V] (i.e., Cong[(7, S] for subsets C ^ is not needed). Algorithm |2] shows the procedure in more 
detail. 



Algorithm 2: Minimum Congestion 
Input: Binary tree T rooted at g, request graph Gr 

Output: Minimum congestion in embedding V into T such that requirements R are satisfied 
1: for all 5^ c T/ do 

2. Fl0W[S'] <- f(^^g) + ^ 

3: end for 

4: for all leaves u e L, and S CV do 

5: Cong[w, S*] ^ if IS"! < 1, cxo otherwise 
6: end for 

7: for j = 1,...,0 do 

8: for all u ^ U ,u not a leaf do 

9: <,„in ^ OO 

10: for all c y do 

11: for all SiQSAo 

12: t ^ max {Cong[u,, 5/], CongK, S \ Si],¥\ow[Si]/ c,, , ¥\o^[S \ S'/]/ceJ 

13: if i < tmin then 

14: iinin ^ t 

15: "Smin ^ Si 

16: end if 

17: end for 

18: Cong[w, S] ^ tmin 

19: Part[u, S] ^ [Smin, S\Smin) 

20: end for 
21: end for 

22: end for 

23: return Cong[g, 



When we update Cong[u, S] we also store the partition {Si, S \ Si) that realizes this optimal congestion in a 
partition table Part[u, S]. After the execution of the algorithm, we can recover the optimal embedding by working 
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backwards in the standard fashion for dynamic programs: starting at g we read the optimal partition {Vi,V\Vi) from 
Part[g, V]. Now we find the optimal partitions of Vi with root gi and ^ \ Vi with root g^, and so on. 
Now we analyze the correctness and runtime: 

Lemma 6. Algorithm\2\finds the minimum congestion of embedding request Gji in a tree network N. 

Proof. By LemmalU optimizing the congestion on N is equivalent to optimizing it on T. The optimal congestion of an 
embedding restricted to T„ requires using an optimal partition into subsets embedded into left and right subtrees of T„, 
and Algorithmic recursively computes the optimal embedding for all possible partitions of the VMs, thus retrieving 
the congestion of the optimal embedding. □ 

Lemma 7. Algorithm^has running time 0{3'^n). 

Proof. We first calculate Flow [5*] for every set S C V. There are 2'^ such sets, and each requires summing over at 
most edges in R, for a runtime of 0{k'^2^), which is 0(3'^) for large enough k. In the main loop, for each u in T 
we compute Cong[u, S] for all sets S* C 1/. If \S\ = i, computing Cong[u, S] requires looking at all 2' subsets of S 
and doing 0(1) work for each one. Summing over all 0{n) nodes and all sets S, this requires 0{n)0(J2i=o (i) 2*) = 
0(3''7i) work total. □ 



5.3 Other Objective Functions and Request Models 

The basic form of our algorithm is not specific to congestion, and the recurrence in Algorithm|2]can easily be modified 
to optimize for any objective function for which we can write a similar recurrence. For instance, if each edge in T has 
a delay and bandwidth capacity, we can minimize the average or maximum latency between VMs subject to satisfying 
bandwidth constraints (with a slightly more complex recurrence). 

In practice it may not be desirable to allow request graphs to have arbitrary topologies and edge weights. If a 
request graph is sufficiently simple and uniform, then the complexity results of Section |4] no longer apply, and we no 
longer need to consider all 2*^ cuts of Gr at each node. For instance, if Gr is a clique with equal bandwidth on all 
edges, then the congestion of embedding a set of VMs S into is dependent only on the size of S, so we only need 
to consider k + 1 subproblems for each node in T. 

Ballani et al. OBCKRl II describe a virtual cluster request model, which consists of requests of the form (fc, B) 
representing k VMs each connected to a virtual switch with a link of bandwidth B. Such a request {k, B) is similar 
(but not identical) to a request consisting of a clique on k VMs and a bandwidth guarantee of B/{k — 1) on each edge 
of the clique in our setting. We show that when restricted to virtual c/wsfer requests, a modified version of Algorithm 
ID finds the minimum congestion embedding in time 0{nk^). For the sake of completeness and comparison with 
their work, we present Algorithm |3] Similar adjustments could be made to handle other request models for which 
considering all 2^ cuts of the request graph is unnecessary. 

The correctness of Algorithmic an be inferred from the correctness of Algorithm|2]by noting that under the virtual 
c/Mifer request model, all subsets of equal size embed in a subtree with same congestion, i.e. for any 5*1, 52 C V such 
thatl^il = 1^2 1, we have Cong [u, ^i] = Cong [u, 6*2] for all u G T. For every node u and for all z G ... fc. Algorithm 
|3]calculates Cong[7i, z] by optimizing over z + 1 possible splits of the z VMs among its children. A simple recursive 
calculation shows that this computation has complexity X^z^oC"^ + 1) — 0{k^). This shows that the running time of 
Algorithm[3]is 0{nk'^). 



6 Simulations 

In this section we present simulations which verify the correctness and scaling properties of Algorithm |2] in both the 
pairwise bandwidth guarantees model, as well as virtual cluster request model 153] We perform all simulations using 
an unoptimized python implementation of Algorithm |2] on Intel Sandy Bridge Quad Core machine having 4 GB of 
RAM using the networkx graph package [net I to simulate the physical network. 
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Algorithm 3: Min Congestion Embedding for (fc, B) 



1: for all leaves u ^ L, and i G 0, . . . , fc do 

2: Cong[w, i] <— if z < 1, oo otherwise 

3: end for 

4: for j = - 1,. . ., 1 do 
5: for all u G U do 

6: <min ^ OO 

7: for z = 0, . . . , fc do 
8: for i = 0, . . . , z do 

9: /; ^ Z • (fc - • B/(fc - 1) 

10: fr^{z-i)-{k-Z + i)-B/{k~l) 

11: t ^ max{Cong[ui,i],Cong[u2,2 - i], /;/ce, , /r/ce,} 

12: if i < tniin then 

13: tmin ^ i 

14: iniin ^ i 

15: end if 

16: end for 

17: Cong[M, z] ^ tmin 

18: Part[M, z] <- (imin, Z - iniin) 

19: end for 
20: end for 

21: end for 

22: return Cong[g, k] 



Running time vs Network Size (# servers) 



200 400 600 800 1000 1200 1400 1600 1800 2000 
Number of servers (leaves of T) 



(a) Linear variation with n 



Running time vs Request Size (# VIVIs) 




Request size (k) 



(b) Exponential variation with k 



Figure 1: Pairwise bandwidth guarantees between all VMs: Dependence of the running time of Algorithm |2] with (a) 
n, size of the network when /c = 5, (b) and with k, size of the requests when n = 100. 



6.1 Network configuration 

In order to test our algorithm on a realistic networks, we simulate a typical three tier data center network IIAFLV08II 
with servers housed in racks which are connected to a Top-Of-Rack (TOR) switch (tier I). The TOR switches connect 
the racks to other parts of the network via Aggregation Switches (AS, tier II). The AS switches have uplinks connecting 
them to the Core Switch (CS, tier III). We assume that TOR's are connected to the servers with 10 GBps links while 
the uplinks from TORs to the AS's are 40 GBps and from the AS's to the CS's are 100 GBps. We construct a tree 
topology over these elements, recalling that common routing protocols used in practice employ a spanning tree of 
the physical network. We model existing traffic in the data center network using random residual capacities for each 
link. We choose the residual capacity for edge e independently of all other edges and uniformly at random from 
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Table 1 : Linear scan for VM allocation: run time 



n 



k 



Time (hours) 



50 
75 
100 



4 
4 
4 



2.2 
18.8 
80 



[0, c(e)] where c(e) denotes the bandwidth capacity of edge e. The choice of random residual link capacities is forced 
on us due to lack of models describing realistic network flows in a data center We note that Algorithmic finds the 
optimal congestion embedding for any of the distribution of residual capacities on the network links and any choices 
of bandwidth capacities of the Unks. 

6.2 Linear scan over all possible VM allocations 

By implementing a linear scan over all possible VM allocations in the network, we verify the correctness of Algorithm 
|2]by finding the allocation that minimizes congestion. Note that this implementation requires scanning (^)-fc! = 0{n'^) 
feasible VM allocations where n denotes the number of servers in the network and k denotes the request size. Hence 
we choose small network and request sizes n G {50, 75, 100} and k — 4 and verify correctness of the algorithm for 
different request topologies and randomly generated residual capacities on the network links. We observe that this 
procedure requires hours or even days to finish even for very small network and request sizes like n — 125 and k — A 
as seen in Table [T] and hence is infeasible for modern data centers containing hundreds of thousands of servers. In 
contrast, Algorithm |2] has complexity 0{3^n), which is linear in the network size n, and as shown in the next sub 
section, finishes in order of seconds on our simulation setup for small values of fc. 

6.3 Pairwise bandwidth requirements 

Next, we verify the scaling properties of Algorithm |2] with respect to parameters n and fc. First, we fix a request of 
size fc = 5, and plot the running time for increasing values of n, the number of servers, from n = 200 to n = 2000 
in Figure [Tfa) which illustrates the linear variation of run time with respect to n. Next, we fix the network size to 
n = 100 and plot the run time for path requests with lengths from fc = 4 to fc = 10 in Figure[T] This figure shows that 
the run time increases exponential with respect to fc. 

6.4 Virtual Cluster Request Model 

We also verify the scaling properties when the requests are restricted to the virtual cluster model fB CKRlTl . For a 
fixed request (fc, B) where fc = 100 and B — 100Mbps, we plot the running time for increasing values of n, from 
n = 200 to n = 2000 in Figure|2ja) which illustrates the linear variation of run time with n. Next, we fix the network 
size to n = 1000 and plot the run time for fc in the range 10 to 100. These results show that for virtual cluster requests, 
our algorithm finds the minimum congestion embedding in time 0{nk^). 

As mentioned before, a number of heuristic approaches have been formulated to perform VM allocation. However, 
lack of models of the existing network flows inside a data center, especially in the context of enterprise workloads, 
hinders the evaluation and comparison of their performance in realistic settings. In particular, we observe that by 
congesting particular edges in the network, it is possible to make the greedy heuristics for VM mapping perform 
significantly worse than the optimal embedding (output by Algorithm |2|i. However, a thorough comparison with 
heuristics requires models of flow in a data center serving enterprise requests, and we leave this to future work. 



In this paper we study the problem of allocating a graph request within a tree topology and we present a 0{i^n) 
dynamic programming algorithm that embeds the resource request graph of size fc into the data center topology (tree) 



7 Conlusion and Future Work 
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Running time vs Networl^ Size (# sen/ers) — ■ — 




200 400 600 800 1000 1200 1400 1600 1800 2000 
Number of sen/ers (leaves o1 T} 



(a) Linear variation witli n 

Figure 2: y/rfwa/ OMsfer request model: Dependence of the 
model with (a) n, size of the network when k = 10, (b) and 



Running time vs Request Size (# Vivis) — ■ — 




Request size {V.) 
(b) Quadratic variation with k 

running time of Algorithm[3]for the virtual cluster request 
with k, size of the requests when n = 1000. 



of size n to minimize congestion. We believe this is useful in enterprise workloads when the request size k is small. 
For clique requests, we present a 0{Ti'^k) dynamic programming algorithm to allocate clusters of size A; in a tree of 
size n for minimizing congestion, which could be useful for MapReduce-like workloads. We believe that it would 
also be possible to extend our results to hybrid workloads involving tiers of VMs, with both inter-tier as well as 
intra-tier bandwidth guarantees. We also provide hardness results and show that the problem of finding minimum 
congestion embedding in a network remains in A/'P-hard even under the restriction to tree network. We focus on 
minimizing congestion as our objective function, but we believe our methods are applicable to a wider class of metrics 
and objective functions. 
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